Skip to content

Conversation

@maleadt
Copy link
Member

@maleadt maleadt commented Jan 12, 2026

Extends #2937, fixes #2946

Comment on lines +369 to +373
@eval begin
@device_override @inline Base.FastMath.max_fast(x::$T, y::$T) = ifelse(y > x, y, x)
@device_override @inline Base.FastMath.min_fast(x::$T, y::$T) = ifelse(y > x, x, y)
@device_override @inline Base.FastMath.minmax_fast(x::$T, y::$T) = ifelse(y > x, (x, y), (y, x))
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do something like

@device_override @inline Base.FastMath.max_fast(x::$T, y::$T) where {T<:Union{Float16, Float32, Float64}} = ifelse(y > x, y, x)

just to avoid the loop
?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always wary for doing so, because the Base method may then end up being more specific (and we really want these to apply). In this case, Base doesn't use metaprogramming so I guess it could work..

@codecov
Copy link

codecov bot commented Jan 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.31%. Comparing base (da38676) to head (97f7c1e).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3016   +/-   ##
=======================================
  Coverage   89.31%   89.31%           
=======================================
  Files         148      148           
  Lines       12995    12995           
=======================================
  Hits        11606    11606           
  Misses       1389     1389           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 97f7c1e Previous: da38676 Ratio
latency/precompile 55584314156 ns 55180480419 ns 1.01
latency/ttfp 7717855921.5 ns 7807018821.5 ns 0.99
latency/import 4025438307 ns 4140006046.5 ns 0.97
integration/volumerhs 9623132 ns 9623640.5 ns 1.00
integration/byval/slices=1 146664 ns 147119 ns 1.00
integration/byval/slices=3 425554 ns 426080.5 ns 1.00
integration/byval/reference 144978 ns 145207 ns 1.00
integration/byval/slices=2 286179 ns 286491 ns 1.00
integration/cudadevrt 103398 ns 103866 ns 1.00
kernel/indexing 14096 ns 14460 ns 0.97
kernel/indexing_checked 14790 ns 15250 ns 0.97
kernel/occupancy 679.2064516129033 ns 680.6470588235294 ns 1.00
kernel/launch 2083.2 ns 2197.4444444444443 ns 0.95
kernel/rand 15013 ns 18784 ns 0.80
array/reverse/1d 19771 ns 20177 ns 0.98
array/reverse/2dL_inplace 66721 ns 67023 ns 1.00
array/reverse/1dL 69914 ns 70421 ns 0.99
array/reverse/2d 21925 ns 22600 ns 0.97
array/reverse/1d_inplace 11577 ns 10092 ns 1.15
array/reverse/2d_inplace 13285 ns 13669 ns 0.97
array/reverse/2dL 74024 ns 74807 ns 0.99
array/reverse/1dL_inplace 66907 ns 67197 ns 1.00
array/copy 20556 ns 20909 ns 0.98
array/iteration/findall/int 157600.5 ns 159075 ns 0.99
array/iteration/findall/bool 139621 ns 140579 ns 0.99
array/iteration/findfirst/int 160826 ns 161050 ns 1.00
array/iteration/findfirst/bool 161498 ns 162329.5 ns 0.99
array/iteration/scalar 73528 ns 74588 ns 0.99
array/iteration/logical 213816 ns 216756.5 ns 0.99
array/iteration/findmin/1d 91481 ns 96345.5 ns 0.95
array/iteration/findmin/2d 121664 ns 122694 ns 0.99
array/reductions/reduce/Int64/1d 42940 ns 43621 ns 0.98
array/reductions/reduce/Int64/dims=1 50581 ns 44622.5 ns 1.13
array/reductions/reduce/Int64/dims=2 61611 ns 61757 ns 1.00
array/reductions/reduce/Int64/dims=1L 89000 ns 89064 ns 1.00
array/reductions/reduce/Int64/dims=2L 88050 ns 88156 ns 1.00
array/reductions/reduce/Float32/1d 36523.5 ns 38304 ns 0.95
array/reductions/reduce/Float32/dims=1 42444 ns 42098.5 ns 1.01
array/reductions/reduce/Float32/dims=2 59795 ns 60277 ns 0.99
array/reductions/reduce/Float32/dims=1L 52444 ns 52722 ns 0.99
array/reductions/reduce/Float32/dims=2L 71866 ns 72438 ns 0.99
array/reductions/mapreduce/Int64/1d 43028 ns 43667 ns 0.99
array/reductions/mapreduce/Int64/dims=1 44303 ns 45026 ns 0.98
array/reductions/mapreduce/Int64/dims=2 61445 ns 62167 ns 0.99
array/reductions/mapreduce/Int64/dims=1L 89047 ns 89081 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 87991 ns 88471 ns 0.99
array/reductions/mapreduce/Float32/1d 36026 ns 38399 ns 0.94
array/reductions/mapreduce/Float32/dims=1 51970 ns 41819.5 ns 1.24
array/reductions/mapreduce/Float32/dims=2 59669 ns 60230.5 ns 0.99
array/reductions/mapreduce/Float32/dims=1L 52620 ns 52828 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 72070 ns 72045.5 ns 1.00
array/broadcast 19836 ns 20443 ns 0.97
array/copyto!/gpu_to_gpu 12834 ns 13022 ns 0.99
array/copyto!/cpu_to_gpu 214812 ns 216732 ns 0.99
array/copyto!/gpu_to_cpu 287929 ns 283462 ns 1.02
array/accumulate/Int64/1d 124865 ns 124912 ns 1.00
array/accumulate/Int64/dims=1 83675 ns 84105 ns 0.99
array/accumulate/Int64/dims=2 158830 ns 158348 ns 1.00
array/accumulate/Int64/dims=1L 1710434 ns 1710807 ns 1.00
array/accumulate/Int64/dims=2L 966856.5 ns 966629 ns 1.00
array/accumulate/Float32/1d 108995 ns 109358 ns 1.00
array/accumulate/Float32/dims=1 80096 ns 80805 ns 0.99
array/accumulate/Float32/dims=2 147591 ns 148060.5 ns 1.00
array/accumulate/Float32/dims=1L 1619034 ns 1619572.5 ns 1.00
array/accumulate/Float32/dims=2L 697841 ns 698871 ns 1.00
array/construct 1287.4 ns 1243.1 ns 1.04
array/random/randn/Float32 46521.5 ns 48790 ns 0.95
array/random/randn!/Float32 24942 ns 25388 ns 0.98
array/random/rand!/Int64 27269 ns 27419 ns 0.99
array/random/rand!/Float32 8736.333333333334 ns 9072.666666666666 ns 0.96
array/random/rand/Int64 29927 ns 29902 ns 1.00
array/random/rand/Float32 13355 ns 13324 ns 1.00
array/permutedims/4d 54940 ns 57303.5 ns 0.96
array/permutedims/2d 54110 ns 54025.5 ns 1.00
array/permutedims/3d 54939.5 ns 55054.5 ns 1.00
array/sorting/1d 2758522 ns 2759246 ns 1.00
array/sorting/by 3345836 ns 3345927 ns 1.00
array/sorting/2d 1080524 ns 1082310 ns 1.00
cuda/synchronization/stream/auto 1041.9 ns 1045.9 ns 1.00
cuda/synchronization/stream/nonblocking 7703.5 ns 7338.6 ns 1.05
cuda/synchronization/stream/blocking 836.9480519480519 ns 812.9347826086956 ns 1.03
cuda/synchronization/context/auto 1193.2 ns 1170.7 ns 1.02
cuda/synchronization/context/nonblocking 7080.9 ns 7720.5 ns 0.92
cuda/synchronization/context/blocking 933.8857142857142 ns 910.6818181818181 ns 1.03

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt maleadt merged commit aa310ac into master Jan 13, 2026
3 checks passed
@maleadt maleadt deleted the tb/llvm18 branch January 13, 2026 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix This gets something working again.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PTX compile error: ".NaN requires .target sm_80 or higher" on Julia 1.12 (RTX 2080 / sm_75, works fine on Julia 1.11.7)

3 participants